Purpose

This document is based on Gemma’s code.

This is a template for fitting sdmTMB to GOA bottom trawl data. For each species, it fits sdmTMB to life stages (juveniles or adults) to predict CPUE in kg per km\(^{2}\). This predicted CPUE is then scaled up to box area in Atlantis, to get biomass per box and proportion of the total by box, which is what we need to initialise Atlantis. This same code will be run for biomass pool species too, except for those we will not split inot juveniles and adults.

This workflow does not include British Columbia: biomasses and/or numbers of individuals for the Atlantis boxes in the Canadian part of the model will be estimated separately by using this code, with some adjustments, on DFO groundfish bottom trawl data.

This workflow is based on the following assumptions:

  1. We use lat, lon and depth as predictors. We do not use environmental covariates as predictors because we are not attempting to explain why fish species are distributed the way they are, but rather we are trying to have sensible generic distributions over the model domain throughout the study period (1984-2019).
  2. We predict over a regular grid. The size of this grid is 10 km at the moment for computational efficiency, but this is arbitrary and we may need to test different grid sizes and see how the results change. After some testing with predicting over a grid of 1 point per Atlantis box, I chose to use a regular grid because: (1) an average value of lat and lon, such as a centroid, is difficult to calculate for some of the boxes with crescent shapes; and (2) some boxes are placed over areas where depth changes greatly (the GOA bathymetry is complex), and the inside points may fall inside or near a deeper/shallower area withih a certain box. While the Atlantis box itself has a constant depth, the nearest node of the SPDE mesh may have been near such deeper/shallower area, thus skewing the estimate of the biomass index for that particular box.
  3. We are not so interested in accurate predictions for any one year, but rather in representative means of where the fish has been over the last few decades. This code runs a temporal model and takes averages of the estimates at the end.
select <- dplyr::select

Read data

load("data/cpue.atf.A.Rdata")
race_data <- as_tibble(dat_stage)

This is CPUE data for ATF as I received from Gemma (who in turn got it from Kirstin), except the 10-cm size bins have been aggregated into juveniles (J) and adults (A) based on length at 50% maturity. This aggregation occurred at haul level. There are a lot of fields, so let’s select what we need and drop the rest.

Note: for lat and lon, for now I am using the start values for each tow, but it would be easy to get the coordinates for the midpoint of the tow.

fields <- c("YEAR",
            "HAULJOIN",
            "LAT", 
            "LON", 
            "DEPTHR", 
            "SST", 
            "TEMPR", 
            "SPECIES_CODE", 
            "CN",
            "BIOM_KGKM2", 
            "NUM_KM2",
            "STAGE")
  
race_data <- race_data %>% select(all_of(fields)) %>% set_names(c(
  "year",
  "hauljoin",
  "lat",
  "lon",
  "depth",
  "stemp",
  "btemp",
  "species_code",
  "name",
  "biom_kgkm2",
  "num_km2",
  "stage"))

Take a quick look at the data spatially.

# coast for plotting
load("data/goa_coast.Rdata")
coast_sf <- st_as_sf(coast) # turn coast to sf

ggplot()+
  geom_point(data = race_data, aes(lon, lat, colour = log1p(biom_kgkm2)), size = 1.5)+
  scale_colour_viridis_c()+
  geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
  theme_minimal()+
  facet_wrap(~year, ncol = 2)+
  labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons

Take a quick look at time series of total CPUE from raw data

biom_year <- race_data %>% group_by(year) %>% summarise(biom = sum(log1p(biom_kgkm2)))

ggplot(biom_year, aes(year, log(biom)))+
  geom_point()+
  geom_path()+
  theme_minimal()+
  labs(title = paste(race_data$name,"total GOA CPUE from bottom \n trawl survey - stage:", race_data$stage, sep = " "))

The above is across the entire GOA.

Add zeroes for hauls with no catch

Kirstin’s CPUE data includes empty hauls (i.e. zero catches). However, in the step where aggregate the size bin data into life-stage data (not included in this code), we loose those empty hauls (as BIN will be NA for a haul whith zero catch in Kirstin’s data set, which causes zero cathces to be dropped when I join by HAULJOIN). So, we need to add them back to the RACE data here. To do this, I take haul information from the “Haul Descriptions” data set on AKFIN. I then subtract the RACE hauls with catch in race_data from the “Haul Descriptions” list to see which hauls were empty for this particular species / life stage. Then pad these empty hauls with zero CPUEs, and attach them to race_data.

load("data/hauls.Rdata") # as accessible on AKFIN Answers with no further modifications

data_hauls <- levels(factor(race_data$hauljoin))
zero_hauls <- setdiff(levels(factor(hauls$Haul.Join.ID)), data_hauls) # assuming that if there are no records from a haul, the catch in that haul was 0 for this species

# make a data frame to bind by row
zero_catches <- hauls %>% filter(Haul.Join.ID %in% zero_hauls) %>% 
  select(Year, Haul.Join.ID, Ending.Latitude..dd., Ending.Longitude..dd., Bottom.Depth..m., Surface.Temperature..C., Gear.Temperature..C.) %>% 
  mutate(species_code = rep(NA, length(Year)),
         name = rep(NA, length(Year)),
         biom_kgkm2 = rep(0, length(Year)),
         num_km2 = rep(0, length(Year)),
         stage = rep(NA, length(Year))) %>%
  set_names(names(race_data))

# attach by row to race_data
race_data <- rbind(race_data, zero_catches)
# ditch hauls with empty lat or lon
race_data <- race_data %>% filter(!is.na(lat) | !is.na(lon))
# and with NA depths
race_data <- race_data %>% filter(!is.na(depth))

sdmTMB

Create spatial mesh

This is the mesh that the sdmTMB algorithm uses to estimate spatial autocorrelation.

Number of knots (tested on adult ATF): different numbers of knots (e.g. 300, 450, 750) return similar model fits and similar qualitative results, but the correlation of observed vs predicted increases with more knots, suggesting that the finer the SPDE mesh, the more precise the model is. Considerations here would be computation time (750 knots takes ~40 minutes to run) and overfitting. Pearson’s coefficient increase by almost 0.2 from 300 to 750 knots for adult ATF. I would not get too hung up on that. On an intuitive level, a number of knots approaching the data should return a perfect fit, but would the model be useful for predictions outside the data?

Note: SPDE = Stochastic Partial Differential Equations approach. Some material can be found here, but basically it is a way of calculating the position of the mesh knots.

race_spde <- make_mesh(race_data, c("lon", "lat"), n_knots = 450) # usually 450
plot(race_spde)

Check out the distribution of the biomass response variable.

hist(race_data$biom_kgkm2, breaks = 30)

hist(log1p(race_data$biom_kgkm2), breaks = 30)

Proportion of zeroes in percentage.

length(which(race_data$biom_kgkm2 == 0))/nrow(race_data)*100
## [1] 28.02988

Space, time, and depth model.

Try running a model with smooth term for depth. Using 5 knots for the smooth - but this is arbitrary and a range of values could be tested. As a note, I am not scaling depth here. The reason is that depth has a different range in the data and the prediction grid, and thus scaled values have different meaning between the two.

Model type: the distribution of the response variable plotted above should give a sense of what model is most appropriate. CPUE data for many of these species resemble a Tweedie distribution when log-transformed, so we use a Tweedie model with a log link. Some groups may warrant a different model, and this will be evaluated case-by-case depending on convergence issues, distribution of model residuals, and model skill metrics (see below).

Check information on model convergence. From the nlminb help page we know that an integer 0 indicates succesful convergence. Additional information on convergence can be checked with m_depth$model$message. According to the original PORT optimization documentation, “Desirable return codes are 3, 4, 5, and sometimes 6”.

if(m_depth$model$convergence == 0){print("The model converged.")} else {print("Check convergence issue.")}
## [1] "The model converged."
m_depth$model$message
## [1] "relative convergence (4)"

Check out model residuals.

race_data$resids <- residuals(m_depth) # randomized quantile residuals
hist(race_data$resids)

And QQ plot.

qqnorm(race_data$resids)
abline(a = 0, b = 1)

Plot the response curve from the depth smooth term.

plot(m_depth$mgcv_mod, rug = TRUE)

Finally, plot the residuals in space. If residuals are constantly larger/smaller in some of the areas, it may be sign that the model is biased and it over/underpredicts consistently for some areas. Residuals should be randomly distributed in space. We need to read in the Atlantis BGM file to do that, as we need the right projection.

Read in BGM and coast.

atlantis_bgm <- read_bgm("data/GOA_WGS84_V4_final.bgm")
race_sf <- race_data %>% st_as_sf(coords = c(x = "lon", y = "lat"), crs = "WGS84") %>% st_transform(crs = atlantis_bgm$extra$projection) # turn to spatial object

ggplot()+
  geom_sf(data = race_sf, aes(color = resids, alpha = .8))+
  scale_color_viridis()+
  geom_sf(data = coast_sf)+
  theme_minimal()+
  labs(title = paste(race_data$name,"model residuals in space - stage:", race_data$stage, sep = " "))+
  facet_wrap(~year, ncol = 2)

Predictions from SDM

Take a grid (which must contain information on the predictors we used to build the model) and predict the biomass index over such grid based on the predictors.

  1. The grid is currently a regular grid with 10-km cell size, but 10 km might not be enough to get prediction points in all boxes - especially for a couple very small and narrow boxes at the western end of the model domain. Revisit this if necessary, but a finer mesh could be difficult to justify compared to the density of the survey data.
  2. The grid covers the entire Atlantis model domain, including the non-dynamic boundary boxes (deeper than 1000 m). The grid at the moment also includes Canada boxes, although predictions for these boxes will not be considered here.

Read in the Atlantis prediction grid (10 km) modified in Atlantis_grid_covars.R (code not included here).

atlantis_boxes <- atlantis_bgm %>% box_sf()

Important: depth in the RACE data is a positive number. Depth in the prediction grid we obtained from the ETOPO rasters is a negative number. When we use depth as predictor for in our regular grid, make sure depth is a positive number for consistency with the model variable, or else everything will be upside-down. This was done in the script that produces the prediction grid, so depth is positive.

load("data/atlantis_grid_depth.Rdata")

#atlantis_grid_depth <- atlantis_grid_depth %>% mutate(depth = -depth) # making sure that depth has the same sign/orientation in the data and prediction grid

# add coordinate columns
atlantis_coords <- atlantis_grid_depth %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) %>%
  st_transform(crs = "+proj=longlat +datum=WGS84") %>% dplyr::select(geometry)

atlantis_grid <- cbind(atlantis_grid_depth, do.call(rbind, st_geometry(atlantis_coords)) %>%
    as_tibble() %>% setNames(c("lon","lat")))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
paste("Positive depths are:", length(which(atlantis_grid$depth>0)), "out of:", nrow(atlantis_grid_depth), sep = " ") # Write out a check that depths are positive (few negatives are OK - they are on land - I'll fix it but it should not matter as island boxes will be boundary boxes in Atlantis so predictions will not matter for those)
## [1] "Positive depths are: 6237 out of: 6259"
# add year column
all_years <- levels(factor(race_data$year))

atlantis_grid <- atlantis_grid[rep(1:nrow(atlantis_grid), length(all_years)),]
atlantis_grid$year <- as.integer(rep(all_years, each = nrow(atlantis_grid_depth)))

Visualise the prediction grid.

coast_tmp <- map("worldHires", regions = c("Canada", "USA"), plot = FALSE, fill = TRUE)
coast_tmp <- coast_tmp %>% st_as_sf() 

atlantis_grid %>% filter(year == 1984) %>%
  st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
  ggplot()+
  geom_sf(size = 0.1)+
  geom_sf(data = coast_tmp)+
  coord_sf(xlim = c(-172,-128), ylim = c(50, 61))+
  theme_minimal()+
  labs(title = "Prediction grid")

Make SDM predictions onto new data from depth model. Back-transforming here

predictions_race <- predict(m_depth, newdata = atlantis_grid, return_tmb_object = TRUE)
atlantis_grid$estimates <- exp(predictions_race$data$est) #Back-transforming here

atlantis_grid_sf <- atlantis_grid %>% st_as_sf(coords = c("x", "y"), crs = atlantis_bgm$extra$projection) # better for plots

Not plotting most of Canada because the predictions look terrible (due to not having biomass data from there in this model).

ggplot()+
  geom_sf(data = subset(atlantis_boxes, box_id < 92), aes(fill = NULL))+
  geom_sf(data = subset(atlantis_grid_sf, box_id < 92), aes(color=log1p(estimates)))+ # taking the log for visualisation
  geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
  scale_color_viridis(name = expression(paste("Log(CPUE) kg ", km^-2)))+
  theme_minimal()+
  labs(title = paste(race_data$name,"predicted CPUE - stage:", race_data$stage, sep = " "))+
  facet_wrap(~year, ncol = 2)

Attribute the predictions to their respective Atlantis box, so that we can take box averages.

atlantis_grid_means <- atlantis_grid %>% group_by(year, box_id) %>%
  summarise(mean_estimates = mean(estimates, na.rm = TRUE)) %>% ungroup() 
## `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
# join this with the box_sf file

predictions_by_box <- atlantis_boxes %>% inner_join(atlantis_grid_means, by = "box_id")

See estimates per box for all years. Silence boundary boxes as they throw the scale out of whack (and they do not need predictions).

predictions_by_box <- predictions_by_box %>% rowwise() %>% mutate(mean_estimates = ifelse(isTRUE(boundary), NA, mean_estimates))

ggplot()+
  geom_sf(data = predictions_by_box[predictions_by_box$box_id<92,], aes(fill = log1p(mean_estimates)))+ # taking the log for visualisation
  scale_fill_viridis(name = expression(paste("Log(CPUE) kg ", km^-2)))+
  theme_minimal()+
  geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
  facet_wrap(~year, ncol = 2)+
  labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box - stage:", race_data$stage, sep = " "))

Plot the raw data again for comparison.

ggplot()+
  geom_point(data = race_data, aes(lon, lat, colour = log1p(biom_kgkm2)), size = 1.5, alpha = .5)+ # taking the log for visualisation
  scale_colour_viridis_c(name = expression(paste("Log(CPUE) kg ", km^-2)))+
  geom_polygon(data = coast, aes(x = long, y = lat, group = group), colour = "black", fill = "grey80")+
  theme_minimal()+
  facet_wrap(~year, ncol = 2)+
  labs(title = paste(race_data$name,"CPUE from GOA bottom trawl survey - stage:", race_data$stage, sep = " "))
## Regions defined for each Polygons

Have a look at CPUE by depth. This is rough and quick, keep in mind that most tows happen shallower than 300 m, so the sample is not equal between depths.

ggplot(data = race_data, aes(x = depth, y = log1p(biom_kgkm2), color = log1p(num_km2)))+
  scale_color_viridis()+
  geom_point()+
  theme_minimal()+
  labs(title = "CPUE by depth")

Plot data and predictions distributions. These are the data.

ggplot(data = race_data, aes(x = log1p(biom_kgkm2)))+
  geom_histogram(colour = "black", fill = 'grey80', bins = 30)+
  theme_minimal()

And these are the predictions over the 10 km grid.

ggplot(data = atlantis_grid, aes(x = log1p(estimates)))+
  geom_histogram(colour = "black", fill = 'grey80', bins = 30)+
  theme_minimal()

Mean predictions for the study period

Now calculate means of the predictions for the entire study period. Doing it by taking 1984-2019 averages for each Atlantis box.

means_all_years <- predictions_by_box %>% group_by(box_id, area, boundary) %>% summarise(all_years_kgkm2 = mean(mean_estimates)) %>% ungroup()
## `summarise()` has grouped output by 'box_id', 'area'. You can override using the `.groups` argument.
ggplot()+
  geom_sf(data = means_all_years[means_all_years$box_id < 92,], aes(fill = log1p(all_years_kgkm2)))+ # log for visualisation
  scale_fill_viridis(name = expression(paste("Log(CPUE) kg ", km^-2)))+
  geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
  theme_minimal()+
  labs(title = paste(race_data$name, "mean predicted CPUE by Atlantis box (1984-2019) - stage:", race_data$stage, sep = " "))

Let’s have a look at the variance per box over all years. We use the coefficient of variation, because CPUE varies widely between boxes.

cv_all_years <- predictions_by_box %>% group_by(box_id, area, boundary) %>% summarise(cv = sd(mean_estimates)/mean(mean_estimates)) %>% ungroup()
## `summarise()` has grouped output by 'box_id', 'area'. You can override using the `.groups` argument.
ggplot()+
  geom_sf(data = cv_all_years[cv_all_years$box_id < 92,], aes(fill = cv))+ # log for visualisation
  scale_fill_viridis(name = "CV of CPUE")+
  geom_sf(data = coast_sf, colour = "black", fill = "grey80")+
  theme_minimal()+
  labs(title = paste(race_data$name, "CV of predicted CPUE by Atlantis box (1984-2019) - stage:", race_data$stage, sep = " "))

A couple of boxes have a pretty high CV. This will change between species.

Let’s see how estimated CPUE changes over time, per box.

predictions_by_box %>% 
  ggplot(aes(x = year,y = mean_estimates))+
  geom_point()+
  geom_line()+
  theme_minimal()+
  facet_wrap(~.bx0, scales = "free", ncol = 8)
## Warning: Removed 224 rows containing missing values (geom_point).
## Warning: Removed 48 row(s) containing missing values (geom_path).

Considerable variation over time. It may be worth assigning more weight to earlier years, although the distributions are supposed to be “generally representative” throughout the simulation, at least when it comes to S1-S4.

Model skill

Trying to evaluate model skill by having a look at how well model predictions align with observations.

Since this is a spatially-explicit approach, we need observations and predictions at the same location. We use the locations of all RACE hauls as a prediction grid.

#make a prediction grid from the race data itself
race_grid_tmp <- race_data %>% dplyr::select(lon, lat, depth)

# add year
race_grid <- race_grid_tmp[rep(1:nrow(race_grid_tmp), length(all_years)),]
race_grid$year <- as.integer(rep(all_years, each = nrow(race_grid_tmp)))

# predict on this grid
predictions_at_locations <- predict(m_depth, newdata = race_grid, return_tmb_object = TRUE)
race_grid$predictions <- exp(predictions_at_locations$data$est) # back-transforming here

Now join by year and coordinates to have predictions at the sampling points.

race_corr <- race_data %>% left_join(race_grid, by = c("year", "lat", "lon"))

Observed versus predicted

paste0("Pearson's coef observations vs predictions: ", cor(race_corr$biom_kgkm2, race_corr$predictions, use = "everything", method = "pearson"))
## [1] "Pearson's coef observations vs predictions: 0.662288279639506"

Plot.

ggplot(race_corr, aes(x = log1p(biom_kgkm2), y = log1p(predictions)))+ # log for visualisation
  geom_point(aes(color = depth.y))+
  scale_color_viridis()+
  geom_abline(intercept = 0, slope = 1)+
  theme_minimal()+
  facet_wrap(~year, scales = "free")+
  labs(title = paste(race_data$name, "observed vs predicted CPUE. Stage: ", race_data$stage, sep = " "))

These models often underpredict zeroes, i.e. they predict a catch where there was none. Does this happen randomly in space? Does it have a correlation of some kind with depth?

Plot zero catch from the data and the relative predictions. Turn to sf for plotting.

race_corr %>% filter(biom_kgkm2 == 0) %>%
  st_as_sf(coords = c(x = "lon", y = "lat"), crs = 4326) %>%
  ggplot()+
  geom_sf(aes(color = log1p(predictions)))+
  geom_sf(data = coast_sf)+
  scale_color_viridis()+
  theme_minimal()+
  labs(title = "Model predictions at zero-catch locations")+
  facet_wrap(~year, ncol = 2)

For adult ATF with 450 knots and k = 5 in the depth spline, it seems that non-zero CPUE is predicted by the model at locations of zero catch all over the GOA. However, the magnitude of the CPUE values does not seem completely random in space. Smaller CPUE values are predicted in some years for zero-catch on the slope (i.e., at depth). Some of the higher values are predicted East of Kodiak.

What about the relationship between model residuals and depth?

race_data %>%
  ggplot()+
  geom_point(aes(x = depth, y = resids, color = log1p(biom_kgkm2)))+
  geom_hline(yintercept = 0, color = "red", linetype = "dashed")+
  scale_color_viridis()+
  theme_minimal()+
  facet_wrap(~year, ncol = 2)

There does not seem to be an obvious relationship between depth and residuals.

Root Mean Square Error (RMSE)

Calculate RMSE between predicted and observed values.

paste("RMSE:", sqrt(sum((race_corr$predictions - race_corr$biom_kgkm2)^2)/nrow(race_corr)), " num km-2", sep = " ") ### traditional rmse metric, in units kg km2
## [1] "RMSE: 8763.22396382643  num km-2"

Normalised RMSE.

rmse_cv <- sqrt(sum((race_corr$predictions - race_corr$biom_kgkm2)^2)/nrow(race_corr))/(max(race_corr$biom_kgkm2)-min(race_corr$biom_kgkm2))*100 #### normalised rmse, expressed as a % of the range of observed biomass values, sort of approximates a coefficient of variation 
paste("Normalised RMSE:", paste0(rmse_cv, "%"), sep = " ")
## [1] "Normalised RMSE: 1.61702674150508%"

Total biomass and biomass per box

The current estimated CPUE is in kg km\(^{-2}\). So, just we just turn that into biomass per box. Remember that the area is in m\(^2\) for the boxes, so need to divide by 1,000,000.

Do this separate for Alaska and Canada, since we are using different data. Total biomass to calculate only for the respective regions, as well as box biomass. Proportional biomass (S1-S4), instead, makes sense only for the entire model domain, so that will be in another script.

means_all_years <- means_all_years %>% mutate(biomass = all_years_kgkm2*area*1e-06)

means_alaska <- means_all_years %>% filter(box_id<92)
means_alaska %>% select(box_id, all_years_kgkm2, biomass) %>% st_set_geometry(NULL) %>% kable(align = 'lccc', format = "markdown", 
      col.names = c("Box", "CPUE (kg km-2)", "Biomass"))
Box CPUE (kg km-2) Biomass
0 NA NA
2 NA NA
3 123.37086 120908.34
4 639.05798 1330839.59
5 513.20340 1361871.50
6 147.41310 302141.59
7 756.90280 2518863.34
8 NA NA
9 1074.42269 2706435.37
10 472.33562 175111.54
11 NA NA
12 546.99643 2739397.39
13 832.10656 8188943.40
14 75.44966 132550.19
15 644.84143 903491.26
16 253.08246 890024.36
17 1798.06321 7305101.76
18 3125.56073 25298953.36
19 93.40908 126096.01
20 1481.60846 1766194.83
22 291.28098 722483.95
23 1573.90252 2791744.04
24 3826.57944 32150710.16
25 1026.91929 9437559.55
26 659.89025 1544845.46
27 2634.48930 14654032.38
28 1084.24997 3264519.67
29 4505.98115 33419051.25
30 NA NA
31 113.02952 662062.51
32 5516.80865 6178799.11
33 3981.98586 15502470.84
34 9483.03101 42324827.75
35 217.56437 1499490.32
36 6235.61184 4955107.80
37 5220.65641 38688456.59
38 220.66517 818115.78
39 7744.69291 44209688.26
41 4226.05512 2752793.96
42 6087.09577 25552475.55
43 1629.26149 3157723.71
44 626.80968 1096411.93
45 1680.40857 11921189.15
46 81.26325 437334.52
47 1744.91985 9897663.22
48 4923.98189 83701061.29
49 1092.62171 3832212.58
50 1512.79649 6633873.47
51 8338.44893 55323451.08
52 3826.72297 14579928.51
53 NA NA
54 3114.13109 2309367.87
55 2609.79297 29155581.96
56 1337.73739 6793965.04
57 348.85857 40374.29
58 NA NA
59 63.75207 140267.36
60 1548.47210 1434366.85
61 1607.71240 2860299.34
62 NA NA
64 1075.65466 5845489.92
65 596.90258 631244.41
66 2141.24014 2279662.72
67 2257.88144 13541512.65
68 310.57189 254234.08
69 NA NA
70 389.31827 339522.99
71 1425.72086 6893902.26
72 48.60781 90383.82
73 617.06044 960071.65
74 1440.42593 7770681.46
75 952.51443 4060668.61
76 1292.33071 3984541.80
77 70.44240 48157.80
78 1156.51819 3925217.25
79 1650.32951 18706803.44
80 194.29755 181799.71
81 NA NA
82 1923.60976 3471342.73
83 2210.80517 5153961.04
84 482.66408 840782.84
85 66.29093 236168.48
87 1791.80848 2114497.40
88 647.60041 4636571.79
89 NA NA
90 1588.48163 6992828.36
91 483.79507 753294.65

Validation metrics

Let’s produce a table that includes: convergence metrics; Pearson’s correlation coefficient for predicted vs observed; RMSE; and normalised RMSE.

data.frame(m_depth$model$convergence, # convergence
           m_depth$model$message, # more convergence
           cor(race_corr$biom_kgkm2, race_corr$predictions, use = "everything", method = "pearson"), # correlation
           sqrt(sum((race_corr$predictions - race_corr$biom_kgkm2)^2)/nrow(race_corr)),# RMSE
           sqrt(sum((race_corr$predictions - race_corr$biom_kgkm2)^2)/nrow(race_corr))/(max(race_corr$biom_kgkm2)-min(race_corr$biom_kgkm2))*100 # NRMSE
) %>% set_names(c("Convergence","Message","Pearson's correlation","RMSE","NRMSE(%)"))
##   Convergence                  Message Pearson's correlation     RMSE NRMSE(%)
## 1           0 relative convergence (4)             0.6622883 8763.224 1.617027